Internal and external normalization of nascent RNA sequencing run-on experiments

In experiments with significant perturbations to transcription, nascent RNA sequencing protocols are dependent on external spike-ins for reliable normalization. Unlike in RNA-seq, these spike-ins are not standardized and, in many cases, depend on a run-on reaction that is assumed to have constant efficiency across samples. To assess the validity of this assumption, we analyze a large number of published nascent RNA spike-ins to quantify their variability across existing normalization methods. Furthermore, we develop a new biologically-informed Bayesian model to estimate the error in spike-in based normalization estimates, which we term Virtual Spike-In (VSI). We apply this method both to published external spike-ins as well as using reads at the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3^\prime$$\end{document}3′ end of long genes, building on prior work from Mahat (Mol Cell 62(1):63–78, 2016. 10.1016/j.molcel.2016.02.025) and Vihervaara (Nat Commun 8(1):255, 2017. 10.1038/s41467-017-00151-0). We find that spike-ins in existing nascent RNA experiments are typically under sequenced, with high variability between samples. Furthermore, we show that these high variability estimates can have significant downstream effects on analysis, complicating biological interpretations of results. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05607-3.


Characteristics of 3 ′ regions
For our n = 1198 genes with suitable length for a 180kb threshold, we find that the median length is 288254bp, the mean length is 371005bp, while the minimum length region is 180201bp and the longest region is 2220164bp.

The variance distribution for the VSI is highly skewed
In our testing, we observe that the typical estimated variance for normalization factors is around 1, corresponding to a 2-fold change.However, this estimated variance is noisy, with the 94% Highest Density Interval (HDI) of the variance lying somewhere between 0.10 and 2 in log 2 transformed space.This means the most likely variance on our normalization estimate is somewhere between 1.1-fold and 4-fold, which is a wide range.Because the estimated variance has a skewed distribution, our variance estimates are shifted to be relatively large because of the long tail in our data.

Analysis of Samples
For the sake of consistency of analysis, samples were only analyzed if they could be processed by our analysis pipeline without error or modification to the pipeline code.A full list of files that we used as input for our pipeline and if/why they were excluded can be found in Supplemental File 2. Samples were processed separately in groups of single-end and paired-end samples using the Nascent-Flow pipeline, then analyzed downstream.Once processed, samples were then grouped by experiment and analyzed using the VSI Nextflow pipeline.Two different analyses were performed -one on RefSeq annotated genes in the dm6 genome and one on RefSeq annotated genes constrained to a 180kb / 60 minute 3 ′ threshold.60 minutes was selected among all experiments for the sake of consistency as well as to test the assumptions made in the choice of the 3 ′ invariant region.A summary of the data sets used for this study can be found in Table S1.
Data used for this

MCMC convergence and autocorrelation
As discussed in the main text (See Figure S5), the convergence of model parameters is dependent on the sampler used.The NUTS sampler used for the continuous variables is more efficient in exploring the sampling space than Metropolis-Hastings, which is only used to sample from the two discrete

Experiment
Cell Type SRP Project GSE Accession Number Aoi 2020 [1] DLD-1 SRP247346 GSE144786 Barbieri 2020 [2] THP-1 SRP242477 GSE143844    Estimates of the negative binomial distributions (top half of plot) show autocorrelation for much longer than estimates of the mean and variance (bottom half of plot).Some degree of autocorrelation is expected when fitting count data under the assumption that the 3 ′ region is invariant, as the choice of an invariant region implies that values should be correlated across samples.Additionally, some degree of autocorrelation is likely unavoidable, because in count data both autocorrelation and overdispersion can have the same causes [15]

Figure S2 :
Figure S2: Comparison of internal 3 ′ normalization to exogenous spike-ins.This figure is the same as Figure 3 but each panel B-F now with error bars.

Figure S3 :Figure S4 :
FigureS3: The same as Figure5, but using the 40 minute for comparison instead of the 60 minute timepoint.

3 Figure S5 :
Figure S5: Parameter autocorrelation across 4 runs (chains) (columns, left to right) for a single pairwise comparison (SRR5364303 vs SRR5364304) corresponding to the Dukler 2017[5] 0 minute replicates 1 and 2. Estimates of the negative binomial distributions (top half of plot) show autocorrelation for much longer than estimates of the mean and variance (bottom half of plot).Some degree of autocorrelation is expected when fitting count data under the assumption that the 3 ′ region is invariant, as the choice of an invariant region implies that values should be correlated across samples.Additionally, some degree of autocorrelation is likely unavoidable, because in count data both autocorrelation and overdispersion can have the same causes[15]